SECTION I

Q1

Read in the gapminder_clean.csv data as a tibble using read_csv.

df <- as_tibble(read.csv("gapminder_clean.csv",
  check.names = FALSE,
  stringsAsFactors = TRUE
)[, -1]) %>%
  mutate_all(~ na_if(., "")) %>%
  mutate(Year = as.factor(Year))

datatable(df, caption = "Loaded dataset")

Variables characterization

vars <- data.frame(
  `Unique values` = sapply(lapply(df, unique), length),
  `NA count` = colSums(is.na(df)), Type = sapply(df, class),
  check.names = FALSE
)

kable(vars, caption = "Columns (variables) in dataset")
Columns (variables) in dataset
Unique values NA count Type
Country Name 263 0 factor
Year 10 0 factor
Agriculture, value added (% of GDP) 1395 1179 numeric
CO2 emissions (metric tons per capita) 2174 414 numeric
Domestic credit provided by financial sector (% of GDP) 1692 864 numeric
Electric power consumption (kWh per capita) 1338 1238 numeric
Energy use (kg of oil equivalent per capita) 1380 1197 numeric
Exports of goods and services (% of GDP) 1771 798 numeric
Fertility rate, total (births per woman) 1939 184 numeric
GDP growth (annual %) 1880 691 numeric
Imports of goods and services (% of GDP) 1771 798 numeric
Industry, value added (% of GDP) 1388 1189 numeric
Inflation, GDP deflator (annual %) 1671 706 numeric
Life expectancy at birth, total (years) 2381 189 numeric
Population density (people per sq. km of land area) 2537 49 numeric
Services, etc., value added (% of GDP) 1388 1186 numeric
pop 1285 1323 numeric
continent 6 1323 factor
gdpPercap 1285 1323 numeric

There are a total of 19 variables in 2607 rows in the dataset. Most of the variables are continuous, while only Country Name, Year and continent are discrete (All discrete variables will be treated as factors in the remaining of this analysis).

Empty strings were converted to NA values. It can be observed that there is an important number of NA values in the dataset across almost all variables. NA values are later filtered for the purpose of specific analyses.

There is a total of 263 countries across 10 specific years spanning from 1962 to 2007 in increments of 5 (e.g. 1962, 1967, 1972, etc), including 6 continents.

Continuous variables distributions

df %>%
  dplyr::select(where(is.numeric)) %>%
  gather() %>%
  filter(!is.na(value)) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 10) +
  facet_wrap(~key, scales = "free")

Most continuous variables show a somewhat exponential distribution. The exceptions being: Fertility rate, total (births per woman) (somewhat uniform), Life expectancy at birth, total (years) (somewhat normal with a skew), and Services, etc., value added (% of GDP) (normal).

Q2

Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing 'CO2 emissions (metric tons per capita)' and gdpPercap for the filtered data.

df_1962 <- df %>%
  filter(Year == 1962) %>%
  filter(!is.na(gdpPercap) &
    !is.na(`CO2 emissions (metric tons per capita)`)) %>%
  dplyr::select(
    `Country Name`, Year, gdpPercap,
    `CO2 emissions (metric tons per capita)`
  )

kable(summary(df_1962, maxsum = 0), caption = "Summary of variables")
Summary of variables
Country Name Year gdpPercap CO2 emissions (metric tons per capita)
(Other):108 (Other):108 Min. : 355.2 Min. : 0.00848
NA NA 1st Qu.: 1064.7 1st Qu.: 0.15407
NA NA Median : 2510.7 Median : 0.37623
NA NA Mean : 5173.3 Mean : 2.35303
NA NA 3rd Qu.: 6157.1 3rd Qu.: 2.61142
NA NA Max. :95458.1 Max. :42.63712

After filtering for the desired year, there is corresponding data for 108 countries (removing NA values). gdpPercap ranges from 355.2 to 95,458, and showing a fairly linear relationship with CO2 emissions (metric tons per capita) which ranges from 0.008481 to 42.64.

p <- df_1962 %>%
  ggplot(aes(x = gdpPercap, y = `CO2 emissions (metric tons per capita)`)) +
  geom_point(aes(text = `Country Name`)) +
  labs(title = "CO2 emissions vs GDP", x = "GPD Per Capita") +
  theme_bw() +
  scale_y_log10() +
  scale_x_log10() +
  geom_smooth(method = lm, formula = y ~ x)

ggplotly(p)

Q3

On the filtered data, calculate the correlation of 'CO2 emissions (metric tons per capita)' and gdpPercap. What is the correlation and associated p value?

result <- cor.test(df_1962$gdpPercap,
  df_1962$`CO2 emissions (metric tons per capita)`,
  use = "complete.obs"
)

print(result)
## 
##  Pearson's product-moment correlation
## 
## data:  df_1962$gdpPercap and df_1962$`CO2 emissions (metric tons per capita)`
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8934697 0.9489792
## sample estimates:
##       cor 
## 0.9260817

The correlation between said variables is 0.9261 with a p-value of 1.129^{-46} (highly significant), meaning the data is well correlated.

Q4

On the unfiltered data, answer “In what year is the correlation between 'CO2 emissions (metric tons per capita)' and gdpPercap the strongest?” Filter the dataset to that year for the next step…

corr_by_year <- df %>%
  group_by(Year) %>%
  summarize(Correlation = cor(gdpPercap,
    `CO2 emissions (metric tons per capita)`,
    use = "complete.obs"
  )) %>%
  arrange(desc(Correlation))

kable(corr_by_year, caption = "Correlation by year")
Correlation by year
Year Correlation
1967 0.9387918
1962 0.9260817
1972 0.8428986
1982 0.8166384
1987 0.8095531
1992 0.8094316
1997 0.8081396
2002 0.8006421
1977 0.7928336
2007 0.7204169

Correlation between 'CO2 emissions (metric tons per capita)' and gdpPercap is the strongest on year 1967.

Q5

Using plotly, create an interactive scatter plot comparing 'CO2 emissions (metric tons per capita)' and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.

df_1967 <- df %>%
  filter(Year == 1967) %>%
  filter(!is.na(gdpPercap) &
    !is.na(`CO2 emissions (metric tons per capita)`)) %>%
  dplyr::select(
    `Country Name`, Year, gdpPercap,
    `CO2 emissions (metric tons per capita)`, continent, pop
  )

kable(summary(df_1967, maxsum = 0), caption = "Summary of variables")
Summary of variables
Country Name Year gdpPercap CO2 emissions (metric tons per capita) continent pop
(Other):113 (Other):113 Min. : 349 Min. : 0.0118 (Other):113 Min. : 70787
NA NA 1st Qu.: 1136 1st Qu.: 0.1813 NA 1st Qu.: 2427334
NA NA Median : 2742 Median : 0.5749 NA Median : 5212416
NA NA Mean : 5768 Mean : 2.7235 NA Mean : 25359863
NA NA 3rd Qu.: 7656 3rd Qu.: 4.4315 NA 3rd Qu.: 12607312
NA NA Max. :80895 Max. :43.4283 NA Max. :754550000
p <- df_1967 %>%
  ggplot(aes(x = gdpPercap, y = `CO2 emissions (metric tons per capita)`)) +
  geom_point(aes(text = `Country Name`, col = continent, size = pop)) +
  labs(title = "CO2 emissions vs GDP", x = "GPD Per Capita") +
  theme_bw() +
  scale_y_log10() +
  scale_x_log10() +
  geom_smooth(method = lm, formula = y ~ x)

ggplotly(p)

Corresponding data exist for 113 countries for year 1967. The relationship between 'CO2 emissions (metric tons per capita)' and gdpPercap remains highly linear.

SECTION II

Q1

What is the relationship between continent and 'Energy use (kg of oil equivalent per capita)'? (stats test needed)

df_continent <- df %>%
  filter(!is.na(continent) &
    !is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  dplyr::select(
    continent,
    `Energy use (kg of oil equivalent per capita)`, pop,
    `Country Name`
  )

kable(summary(df_continent, maxsum = 6)[, 1:3],
  caption = "Summary of variables"
)
Summary of variables
continent Energy use (kg of oil equivalent per capita) pop
: 0 Min. : 9.715 Min. :1.821e+05
Africa :199 1st Qu.: 484.129 1st Qu.:4.437e+06
Americas:188 Median : 995.564 Median :1.015e+07
Asia :185 Mean : 1992.607 Mean :4.389e+07
Europe :256 3rd Qu.: 2895.659 3rd Qu.:3.155e+07
Oceania : 20 Max. :14746.031 Max. :1.319e+09

There is a total of 848 rows without NA values. Observation count is fairly balanced between continents with the exception of Oceania, where there is data for only 20 rows.

Exploring a linear regression

model_base <- lm(
  `Energy use (kg of oil equivalent per capita)` ~ continent,
  df_continent
)

model_base_summary <- summary(model_base)

model_base_summary
## 
## Call:
## lm(formula = `Energy use (kg of oil equivalent per capita)` ~ 
##     continent, data = df_continent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2796.0 -1107.5  -349.1   276.8 12904.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          698.5      137.2   5.090 4.42e-07 ***
## continentAmericas   1005.1      196.9   5.105 4.10e-07 ***
## continentAsia       1168.8      197.7   5.911 4.93e-09 ***
## continentEurope     2447.5      183.0  13.377  < 2e-16 ***
## continentOceania    3281.8      454.1   7.227 1.11e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1936 on 843 degrees of freedom
## Multiple R-squared:  0.1963, Adjusted R-squared:  0.1924 
## F-statistic: 51.46 on 4 and 843 DF,  p-value: < 2.2e-16

A linear regression for this data set shows that there are statistically significant differences across all continents (p-values < 0.05) in energy use. Being Oceania the continent with the highest consumption, and the lowest being Africa (which is the base, i.e. just intercept value). The regression coefficients show that, for example Asia consumes, in average, 1169 kg of oil equivalent yearly per person more than Africa, and 164 kg (\(1169-1005\)) more than the Americas.

Overall, adjusted R-squared is 0.1924386 meaning that 19.24% of observed variance in 'Energy use (kg of oil equivalent per capita)' is explained by continent.

plot(model_base)

Looking at the diagnostic plots, the Residuals vs Fitted shows a somewhat horizontal line (with a small valley), hinting that the relationship between the variables is fairly linear but variables could use transformation (explored below), this is supported by the observed separation (points vs dotted line) at the right side of the Normal Q-Q plot.

The Scale-Location plot hints heteroscedasticity (non constant variance), which is not surprising because all the others variables in the data set are left out, including a time variable (Year), which could hint that the mean of 'Energy use (kg of oil equivalent per capita)' does not remain constant over time (which is to be expected).

Finally Residuals vs Leverage plot does not show points beyond Cook’s distance, hinting that outliers are not to be expected (although there is a set of observations with high leverage, which could be further explored).

model_log <- lm(log10(`Energy use (kg of oil equivalent per capita)`) ~
  continent, df_continent)

model_log_summary <- summary(model_log)

model_log_summary
## 
## Call:
## lm(formula = log10(`Energy use (kg of oil equivalent per capita)`) ~ 
##     continent, data = df_continent)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73812 -0.21810 -0.03864  0.17894  1.16710 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.72558    0.02627 103.746  < 2e-16 ***
## continentAmericas  0.27191    0.03769   7.214 1.21e-12 ***
## continentAsia      0.23187    0.03785   6.126 1.38e-09 ***
## continentEurope    0.70072    0.03502  20.007  < 2e-16 ***
## continentOceania   0.85552    0.08694   9.841  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3706 on 843 degrees of freedom
## Multiple R-squared:  0.3583, Adjusted R-squared:  0.3553 
## F-statistic: 117.7 on 4 and 843 DF,  p-value: < 2.2e-16

Log transforming variables

In this case, the only variable that could be transformed is Energy use (kg of oil equivalent per capita), as continent is discrete.

Log transforming energy use improved the fit, yielding an adjusted R-squared of 0.3552871 (a significant increase from 0.1924386).

Overall, this model shows benefits from log transforming Energy use (kg of oil equivalent per capita) as more variance can be explained by continent.

While interpreting the coefficients, as shown below, for the log transformed model it is important to take into account that the unit is in log10 scale, meaning, for example in the case of Oceania, that the log10 of Energy use (kg of oil equivalent per capita) increases by 0.85552 from its base (Africa/the intercept) when the country is in Oceania. As an example, an estimation of the energy use when the country is in Oceania would be: \(10^{(2.72558+0.85552)}\)

plot(model_log)

The better fit is also evidenced by improvement in the diagnostic plots. As can be seen in Residuals vs Fitted showing a more horizontal line with a smaller peak than previously seen, and Normal Q-Q plot showing that the right tail of points follow the dotted line closer.

Scale-Location and Residuals vs Leverage remain similar to what was previously seen.

Non linear relationships

Non linear relationships were left out of this analysis.

ANOVA

df_continent <- df %>%
  filter(!is.na(continent) &
    !is.na(`Energy use (kg of oil equivalent per capita)`))

model_log <- aov(log10(`Energy use (kg of oil equivalent per capita)`) ~
  continent, df_continent)

aov_summary <- summary(model_log)

aov_summary
##              Df Sum Sq Mean Sq F value Pr(>F)    
## continent     4  64.66  16.165   117.7 <2e-16 ***
## Residuals   843 115.79   0.137                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not surprisingly, ANOVA confirms that there are statistical differences between continents for the variable in question, showing a very significant p-value of 9.1902822^{-80} (the interpretation of p-value is explained in more detail in the next analysis).

Plots

p <- df_continent %>%
  ggplot(aes(
    x = continent,
    y = `Energy use (kg of oil equivalent per capita)`
  )) +
  geom_boxplot(aes(col = continent)) +
  labs(title = "Energy use vs continent", x = "GPD Per Capita") +
  scale_y_log10()

ggplotly(p)

Differences in energy use between continents are evidenced visually in the box plot above, showing different Y-axis locations across continents. Africa is the continent with the smallest energy use, followed by Asia and the Americas which are similar among each other, while Europe is most similar with Oceania (albeit Oceania shows less dispersion, driven mostly by its limited number of observations).

It is to be noted that the countries with the max consumption of energy is seen in America and Europe, while the minimum is seen in Africa.

Q2

Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990? (stats test needed)

df_1990 <- df %>%
  filter(as.numeric(as.character(Year)) > 1990 &
    continent %in% c("Europe", "Asia") &
    !is.na(`Imports of goods and services (% of GDP)`)) %>%
  dplyr::select(continent, `Imports of goods and services (% of GDP)`, Year)

kable(summary(df_1990), caption = "Summary of variables")
Summary of variables
continent Imports of goods and services (% of GDP) Year
: 0 Min. : 0.07951 2002 :56
Africa : 0 1st Qu.: 27.71081 2007 :56
Americas: 0 Median : 38.77832 1997 :53
Asia : 98 Mean : 44.12648 1992 :47
Europe :114 3rd Qu.: 56.12958 1962 : 0
Oceania : 0 Max. :183.91553 1967 : 0
NA NA (Other): 0

Complete corresponding data exists for 212 rows after 1990.

ANOVA

anova <- aov(`Imports of goods and services (% of GDP)` ~ continent, df_1990)

aov_summary <- summary(anova)

aov_summary
##              Df Sum Sq Mean Sq F value Pr(>F)
## continent     1   1347  1347.2   2.012  0.158
## Residuals   210 140594   669.5

An ANOVA test does not show significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' with a p-value of 0.1575197 for the F statistic. Meaning there is a 15.75% chance that the observed variation is due to random chance, which is usually not enough to reject the null hypothesis (that the samples come from the same population). P-value should be at least lower than 0.05 to reject the null hypothesis with a 95% confidence interval.

Plots

p <- df_1990 %>%
  ggplot(aes(
    x = continent,
    y = `Imports of goods and services (% of GDP)`, col = continent
  )) +
  geom_boxplot()

ggplotly(p)

The box plots visually confirms there are not statistical differences. Both samples have a similar mean and somewhat similar interquartile range, although dispersion is higher in Asia.

Q3

What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

df_pop <- df %>%
  filter(!is.na(`Population density (people per sq. km of land area)`)) %>%
  group_by(`Country Name`) %>%
  summarize(
    AvgPopDensity =
      mean(`Population density (people per sq. km of land area)`)
  ) %>%
  arrange(desc(AvgPopDensity))

kable(summary(df_pop, maxsum = 0), caption = "Summary of variables")
Summary of variables
Country Name AvgPopDensity
(Other):262 Min. : 0.139
NA 1st Qu.: 20.014
NA Median : 47.132
NA Mean : 261.664
NA 3rd Qu.: 121.271
NA Max. :14732.035
df_pop_top <- df_pop %>%
  top_n(5, wt = AvgPopDensity)

p <- df %>%
  filter(`Country Name` %in% df_pop_top$`Country Name`) %>%
  ggplot(aes(
    y = `Population density (people per sq. km of land area)`,
    x = reorder(
      `Country Name`,
      -`Population density (people per sq. km of land area)`
    ),
    col = `Country Name`
  )) +
  geom_boxplot() +
  labs(title = "Top 5 countries by pop. density", x = "Country") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

top_countries <- paste(df_pop_top[
  1:2,
  "Country Name"
]$`Country Name`, sep = ", ")

ggplotly(p)

Average population density varies greatly (0.1394 to 14,732). The top 2 countries with the highest 'Population density (people per sq. km of land area)' across all years are Macao SAR, China, Monaco.

Q4

What country (or countries) has shown the greatest increase in 'Life expectancy at birth, total (years)' since 1962?

df_life_incr <- df %>%
  group_by(`Country Name`) %>%
  filter(!is.na(`Life expectancy at birth, total (years)`)) %>%
  mutate(
    `Life expectancy 1962` =
      `Life expectancy at birth, total (years)`[match(1962, Year)],
    .keep = "all"
  ) %>%
  filter(!is.na(`Life expectancy 1962`)) %>%
  filter(Year == 2007) %>%
  mutate(
    Increase_LE =
      `Life expectancy at birth, total (years)` /
        `Life expectancy 1962`
  ) %>%
  dplyr::select(
    continent, `Country Name`, `Life expectancy 1962`,
    `Life expectancy at birth, total (years)`, Increase_LE
  ) %>%
  arrange(desc(Increase_LE)) %>%
  ungroup()

kable(summary(df_life_incr), caption = "Summary of variables")
Summary of variables
continent Country Name Life expectancy 1962 Life expectancy at birth, total (years) Increase_LE
: 0 Afghanistan : 1 Min. :28.55 Min. :44.18 Min. :0.8451
Africa : 47 Albania : 1 1st Qu.:44.82 1st Qu.:62.03 1st Qu.:1.1477
Americas: 24 Algeria : 1 Median :54.29 Median :71.34 Median :1.2569
Asia : 25 Angola : 1 Mean :54.26 Mean :68.74 Mean :1.2995
Europe : 29 Antigua and Barbuda: 1 3rd Qu.:64.73 3rd Qu.:75.15 3rd Qu.:1.4345
Oceania : 2 Arab World : 1 Max. :73.72 Max. :82.51 Max. :2.0032
NA’s :109 (Other) :230 NA NA NA

In average, all countries increased life expectancy by ~30% from 1962 to 2007 (shown as a blue dotted line in the plot), however the top countries in this ranking yielded a ~100% (or 2x) increase. Continent data was incomplete so that was left out of the analysis.

The plot below shows the distribution of life expectancy increase (Increase_LE) across countries. Something to notice is that after the top 11th country (China) the rate of decrease starts slowing down (showing an elbow in the plot). For this reason, the top 11 countries were selected as the answer to the question presented, and plotted as a bar chart at the end.

Plots

p <- df_life_incr %>%
  ggplot(aes(y = Increase_LE, x = reorder(`Country Name`, -Increase_LE))) +
  geom_point() +
  labs(title = "Countries by life expectancy increase since 1962") +
  geom_hline(
    yintercept = mean(df_life_incr$Increase_LE),
    linetype = "dotted", color = "blue"
  ) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

ggplotly(p)
p <- df_life_incr %>%
  top_n(11, wt = Increase_LE) %>%
  ggplot(aes(
    y = Increase_LE, x = reorder(`Country Name`, -Increase_LE),
    fill = `Country Name`
  )) +
  geom_bar(stat = "identity") +
  labs(
    title =
      "Top countries by life expectancy increase since 1962",
    x = "Country"
  ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

ggplotly(p)